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Abstract 


Architectural  tactics  are  fundamental  design  decisions.  They  are  the  building  blocks  for  both  arc¬ 
hitectural  design  and  analysis.  A  catalog  of  architectural  tactics  has  now  been  in  use  for  several 
years  in  academia  and  industry.  This  report  illustrates  the  use  of  this  catalog  in  industrial  applica¬ 
tions,  describing  how  tactics  can  be  used  in  both  design  and  analysis.  The  report  further  shows 
how  the  needs  of  practice  have  caused  the  catalog  of  availability  tactics  to  be  updated,  but  demon¬ 
strates  that  the  underlying  structure  of  the  tactics  categorization  has  remained  stable.  Finally,  a 
real-world  example  is  provided  of  the  application  of  the  updated  set  of  availability  tactics,  show¬ 
ing  how  applying  tactics  illuminates  design  decisions,  as  guided  by  associated  heuristics  and  ana¬ 
lytic  models. 
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1  Introduction 


The  process  of  designing  and  analyzing  software  architectures  is  complex.  Architectures  are  de¬ 
termined  hy  requirements  (principally  quality  attribute  requirements),  which  are  in  turn  deter¬ 
mined  by  an  organization’s  business  goals  and  constraints.  But  moving  from  the  domain  of  re¬ 
quirements  to  the  domain  of  architecture  has  historically  been  an  art  more  than  a  science. 
Architectural  design  is  a  minimally  constrained  search  through  a  vast  multi-dimensional  space  of 
possibilities.  The  end  result  is  that  architects  are  seldom  confident  that  they  have  done  the  job  op¬ 
timally,  or  even  satisfactorily. 

Architectural  patterns  and  styles  have  been  proposed  as  a  way  to  manage  the  unconstrained  nature 
of  the  architectural  design  process  and  to  reduce  the  enormous  size  and  complexity  of  the  search 
space  [Garlan  1996,  Buschmann  1996].  Such  management  has  simplified  the  architectural  design 
process  somewhat,  but  it  is  still  a  challenge:  styles  and  patterns  have  replaced  one  form  of  black 
magic — architectural  design — with  another  form — choosing,  tailoring,  combining,  and  under¬ 
standing  patterns.  Patterns  are  complex  and  their  interactions  with  other  patterns  are  not  always 
clear.  Furthermore,  patterns  are  always  underspecified,  and  so  the  designer  still  needs  to  add  in 
considerable  amounts  of  detail  to  reify  these  into  an  implementable  design. 

Patterns  come  with  associated  rationale,  benefits,  and  liabilities  (e.g.,  the  Broker  pattern  is  re¬ 
ported  to  have  low  fault  tolerance,  “restricted”  efficiency,  and  to  be  difficult  to  test  and  debug). 
But  such  claims  are  contextual,  depending  on  many  environmental  factors  and  detailed  implemen¬ 
tation  decisions. 

We  have  proposed  a  more  fine-grained  approach  to  architectural  design,  employing  tactics  [Bass 
2003].  Tactics  are  the  building  blocks  of  architectures,  and  hence  the  building  blocks  of  architec¬ 
tural  patterns.  We  have  defined  sets  of  tactics  that  address  six  quality  attributes:  performance, 
usability,  availability,  modifiability,  testability,  and  security.  We  have  used  these  tactics  over  the 
past  five  years,  as  a  foundation  for  designing  and  analyzing  architectures.  So,  for  example,  tactics 
can  ameliorate  some  of  the  deficiencies  outlined  above  for  the  Broker  pattern.  The  low  fault  toler¬ 
ance  of  the  “vanilla”  Broker  pattern  could  be  ameliorated  by  using  some  form  of  active  redundan¬ 
cy,  for  instance.' 

Before  we  discuss  availability  tactics  in  detail  let  us  first  look  at  an  example  of  Modifiability  tac¬ 
tics  to  illustrate  their  impact  [Bachmann  2007].  There  are  three  classes  of  modifiability  tactics: 

1 .  those  that  defer  binding  time  decisions,  to  control  deployment  time  and  cost 

2.  those  that  help  to  localize  changes,  reducing  the  number  of  modules  directly  affected  by  a 
change 

3.  those  that  prevent  ripple  effects,  limiting  the  modifications  to  localized  modules 


As  will  be  described  in  Section  2.3,  in  one  form  of  the  active  redundancy  configuration,  a  group  of  processing 
nodes  comprises  both  active  and  redundant  nodes  that  receive  and  process  identical  inputs  in  parallel,  main¬ 
taining  a  synchronous  state  and  enabling  instantaneous  recovery  and  repair. 
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To  make  an  architecture  more  modifiable,  the  designer  needs  to  select  and  realize  one  or  more 
tactics  from  this  set. 

Patterns  package  a  number  of  tactics.  Let  us  examine  the  most  common  architectural  pattern — the 
Layered  Pattern — to  see  how  this  works  in  practice.  Layers  group  together  similar  sets  of  functio¬ 
nality  and  separate  them  from  other  functions  that  are  expected  to  change  independently. 

Through  this  separation,  the  modifiability  of  the  system  is  expected  to  increase.  For  example, 
layering  is  often  used  to  insulate  a  system  from  changes  in  the  underlying  platform  (hardware  and 
software),  increasing  maintainability  while  reducing  integration  and  verification  costs.  This  has 
been  known  at  least  as  far  back  as  Dijkstra’s  THE  operating  system  [Dijkstra  1968].  To  bring 
about  this  insulation,  the  architect  creates  one  or  more  platform-specific  layers  to  abstract  the  de¬ 
tails  of  the  underlying  hardware  and  operating  system.  The  rest  of  the  system’s  functionality  then 
accesses  the  underlying  platform  via  these  abstractions.  To  achieve  this  effect,  the  Layered  Pattern 
employs  two  Localize  Change  tactics — Semantic  Coherence  and  Abstract  Common  Services — to 
increase  cohesion,  and  it  employs  three  Prevent  Ripple  Ejfects  tactics — Use  Encapsulation,  Use  an 
Intermediary,  and  Restrict  Communication  Paths — to  reduce  coupling  [Bachmann  2007]. 

What  is  the  point  of  this  more  granular  representation  of  design  operations?  A  tactic  is  a  design 
decision  that  is  influential  in  the  control  of  a  single  quality  attribute  response.  As  such  it  is  simple 
to  understand  and  analyze — its  properties  and  effects  are  well  understood.  A  pattern,  on  the  other 
hand,  is  a  prepackaged  solution  to  a  recurring  problem  that  resolves  multiple  forces.  Patterns  are 
more  complex  and  so  it  is  much  harder  to  understand  the  implications  of  changing  the  pattern.  To 
understand  a  pattern,  to  tailor  it,  and  to  analyze  it,  you  need  to  understand  the  tactics  from  which  it 
is  composed  and  the  effects  of  each  constituent  tactic. 

Returning  to  the  Layered  Pattern,  if  the  modifiability  of  the  system  with  respect  to  a  specific  re¬ 
sponsibility  needs  to  be  increased,  one  tactic  that  could  be  employed  is  Use  an  Intermediary. 
Employing  this  tactic  modifies  the  design:  the  independent  functionality  and  the  dependent  func¬ 
tionality  are  now  separated  by  a  third  component — the  intermediary.  Along  with  the  tactic  there  is 
an  analysis  model  (perhaps  encapsulated  in  a  “reasoning  framework”  [Bass  2005])  that  allows  the 
designer  and  the  analyst  to  reason  about  the  cost  of  a  change  both  with  and  without  the  interme¬ 
diary,  and  so  make  a  reasoned  decision.  The  tactic  requires  effort,  and  hence  cost,  to  implement 
and  maintain.  In  addition,  after  the  implementation  of  this  tactic,  the  strength  of  coupling  between 
the  dependent  and  independent  functionality  is  reduced. 

Every  design  decision  has  side  effects.  Once  the  Use  an  Intermediary  tactic  is  in  place,  it  will 
have  an  effect  on  runtime  performance.  Each  of  these  attributes — cost,  coupling,  and  perfor¬ 
mance  impact — can  be  estimated  by  the  architect  and  a  reasoned  decision  can  be  made  on  whether 
to  use  the  tactic. 

In  the  remainder  of  this  report,  we  will  show  how  tactics  are  used  in  practice  and  how  they  inform 
both  design  and  analysis.  In  particular  we  will  show  how  availability  tactics  have  been  used  and 
how  they  have  been  augmented  over  time  to  meet  the  needs  of  a  changing  world. 

1.1  Using  Tactics  in  Practice 

Tactics  can  be  used  in  both  design  and  in  analysis.  They  can  be  used  in  the  design  process  to 
make  decisions  or,  more  commonly,  to  modify  an  architectural  pattern.  In  this  way,  tactics  aid  in 


2  I  CMU/SEI-2009-TR-006 


enumerating  and  choosing  among  design  decisions.  Similarly,  tactics  can  be  used  in  analysis. 

Each  tactic  is  easier  to  understand  in  isolation  than  a  pattern  and,  as  described  by  Bachmann, 
analysis  models  can  be  associated  with  tactics  [Bachmann  2007].  For  example,  there  is  a  formula 
for  determining  the  increase  in  availability  by  adding  a  redundant  hot  spare.  The  architect,  using 
this  formula,  can  then  reason  about  the  costs  and  benefits  of  using  this  form  of  redundancy.  Simi¬ 
larly  the  Ping/Echo  tactic  can  be  analyzed  in  terms  of  how  long  it  takes  to  detect  a  fault  (based  on 
the  period  of  the  Ping  and  the  number  of  missed  Pings  before  a  fault  is  determined  to  have  oc¬ 
curred)  and  how  the  overhead  of  Ping  messages  degrades  end-to-end  latency  in  the  system.  To 
show  how  this  reasoning  is  supported,  a  specific  set  of  tactics — availability  tactics — is  now  dis¬ 
cussed  in  detail. 
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2  Tactics  for  Availability 


Availability  tactics  are  designed  to  enable  a  system  to  endure  system  faults  such  that  a  service 
being  delivered  by  the  system  remains  compliant  with  its  specification  [Bass  2003].  Inherent  in 
this  definition  is  the  distinction  between  a  system  fault  and  a  system  failure.  System  faults  are 
escalated  to  system  failures  once  services  are  impacted  to  the  point  where  they  no  longer  comply 
with  their  specifications.  In  operational  systems,  faults  are  detected  and  correlated  prior  to  being 
reported  and  repaired.  Fault  correlation  logic  will  categorize  a  fault  according  to  its  severity  (crit¬ 
ical,  major,  or  minor)  and  service  impact  (service  affecting  or  non-service  affecting)  in  order  to 
provide  the  system  operator  with  timely  and  accurate  system  status  and  allow  for  the  appropriate 
repair  strategy  to  be  employed.  The  repair  strategy  may  be  automated  or  may  require  manual  in¬ 
tervention. 

System  availability  builds  upon  the  concept  of  system  reliability  by  adding  the  notion  of  recovery, 
which  may  be  accomplished  by  fault  masking,  repair,  or  component  replacement  [Jalote  1994].  In 
practice,  system  requirements  for  availability  are  developed  in  accordance  with  steady-state  avail¬ 
ability  (as  opposed  to  instantaneous  availability).  Steady-state  availability  is  the  measurement  of 
a  system’s  uptime  over  a  sufficiently  long  (90  days,  one  year,  total  mission,  etc.)  duration.  The 
well-known  expression  used  to  derive  steady-state  availability  is 

a  =  MTBF/(MTBF  +MTTR) 

Where  MTBF  refers  to  the  mean  time  between  failures  (derived  based  on  the  expected  value  of 
the  system’s  fault  probability  density  function)  and  MTTR  refers  to  the  mean  time  to  repair 
(which  varies  according  to  the  repair  strategy  employed). 


Table  1:  System  Availability  Requirements 


Availability 

Downtime/90  Days 

Downtime/Year 

99.0  % 

21  hr,  36  min 

3  days,  15.6  hr 

99.9  % 

2  hr,  10  min 

8  hr,  0  min,  46  sec 

99.99  % 

12  min,  58  sec 

52  min,  34  sec 

99.999  % 

1  min,  18  sec 

5  min,  15  sec 

99.9999  % 

8  sec 

32  sec 

In  practice,  system  designers  develop  a  fault  tree  to  characterize  system  faults  according  to  their 
severity  and  service  impact,  and  identify  a  suitable  repair  strategy  for  each  branch  of  the  tree. 
Table  1  provides  an  example  of  typical  system  availability  requirements  and  associated  threshold 
values  for  acceptable  system  downtime,  measured  over  observation  periods  of  90  days  and  one 
year.  The  term  high  availability  typically  refers  to  designs  targeting  availability  of  99.999%  (“5 
nines”)  or  greater.  It  should  be  noted  that  by  definition,  only  unscheduled  outages  contribute  to 
system  downtime. 
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2.1  Updating  the  Tactics  Cataiog 


A  categorization  of  availability  tactics  provided  by  Bass  is  reproduced  in  Figure  1,  below  [Bass 
2003].  As  illustrated,  availability  tactics  are  categorized  according  to  whether  they  address  fault 
detection,  recovery,  or  prevention. 


We  will  review  the  availability  tactics  described  by  Bass  [Bass  2003]  and  then  show  a  new  ver¬ 
sion  of  this  categorization  that  has  been  augmented  and  refined,  based  on  several  years  of  indus¬ 
trial  experience  using  the  categorization  for  the  analysis  and  design  of  high-availability  systems. 

Figure  2  illustrates  the  refined  view  of  the  tactics  outlined  in  Figure  1 .  In  addition  to  refining  the 
categorization,  Figure  2  shows,  below  the  tactics,  some  examples  of  specific  implementation 
techniques  for  each. 

2.2  Fault  Detection  Tactics 

The  tactics  that  were  classified  as  being  for  fault  detection  were  Ping/Echo,  Heartbeat,  and  Ex¬ 
ception.  In  addition  to  these  tactics,  reported  by  Bass  [Bass  2003],  we  have  classified  Voting  as  a 
tactic  whose  primary  purpose  is  fault  detection. 

Ping/Echo  refers  to  an  asynchronous  request/response  message  pair  exchanged  between  nodes, 
used  to  determine  reachability  and  the  round-trip  delay  through  the  associated  network  path. 
Standard  implementations  of  Ping/Echo  are  available  for  nodes  interconnected  via  IP  {ICMP 
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[IETF  1981]  or  ICMPv6  [RFC  2006b]  Echo  Request/Response)  and  MPLS  (LSP  Ping)  networks 
[IETF  2006a]. 

In  addition  we  have  generalized  the  notion  of  the  Heartbeat  tactic  to  System  Monitor.  In  a  high- 
availability  system,  a  System  Monitor  tactic  is  used  to  monitor  state  of  health,  which  includes  the 
detection  of  hung  or  runaway  processes;  a  heartbeat  is  one  measure  of  health  that  a  System  Moni¬ 
tor  could  observe.  When  the  detection  mechanism  is  implemented  using  a  counter  or  timer  that  is 
periodically  reset,  this  specialization  of  System  Monitor  is  referred  to  as  a  Watchdog.  During  no¬ 
minal  operation,  the  process  being  monitored  will  periodically  reset  the  watchdog  counter/timer, 
commonly  referred  to  as  “petting  the  watchdog.” 


Fault 


Availability  Tactics 


Fault  Recovery 


Fault  Detection 


Ping  /  Echo 

ICMP/ICMPV6 
Echo  Req/Reply 

MPLS  Ping 

System  Monitor 
Watchdog 

Heartbeat 

Exception 

Detection 

System 

Exceptions 

Parameter 

Fence 

Parameter 

Typing 

Voting 

Tnple  Modular 
Redundancy 


Active 

Redundancy 

Facilities 

Proleciion 

Passive 

Redundancy 

Spare 

Exception 
Handling 
Error  Codes 

Exception 

Classes 

Software  Upgrade 
Function  Patch 

Class  Patch 
Mitless  ISSU 


Shadow 

Slate 

Resynchronizalion 

Graceful 

Restart 

Rollback 

Coordinated 

Checkpointing 

Uncoordinated 

Checkpointing 

Escalating 

Restart 

Non-Stop 

Forwarding 


Fault  Prevention 


Removal  from 
Service 

Transactions 

Atomic 

Commit 

Protocol 

Process 

Monitor 

Exception 

Prevention 

Exception 

Classes 

Smart  Rrs 

Wrappers 


Fault 

Masked 

or 

Repair 

Made 


Figure  2:  Refined  Availability  Tactics,  with  Examples 
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Table  2:  Availability  Tactics:  Fault  Detection 


Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report 

Mechanism 

Resuit 

Ping/Echo 

asynchronous  request/response  mes¬ 
sage  pair  exchanging  active  nodes 

determines  reachabiiity  and 
round-trip  deiay  through  the  as¬ 
sociated  network  path 

ICMP/ICMPV6 

Echo  Req/Reply 

MPLS  Ping 

System  Monitor 

In  high-availability  system,  moni¬ 
tors  state  of  system  health;  de¬ 
tects  hung  or  runaway  processes 

Watchdog 

Hardware-based  counter-timer  periodi- 
caiiy  reset  by  software 

Upon  expiration,  indicates  to  sys¬ 
tem  monitor  of  a  fault  incurrence 
in  the  process 

Heartbeat 

Periodic  message  exchange  between 
the  system  monitor  and  process 

Indicates  to  system  monitor  when 
fault  is  incurred  in  process 

Exception  Detection 

Detects  system  conditions  that 
alter  the  normal  flow  of  execution 

System 

Exceptions 

System  raises  an  exception  when  it 
detects  a  fauit 

Detects  faults  such  as  divide  by 
zero,  bus  and  address  faults,  il¬ 
legal  program  instructions 

Parameter 

Fence 

Incorporates  an  a  priori  data  pattern 
(such  as  Oxdeadbeef)  piaced  imme- 
diateiy  after  any  variabie-iength  para¬ 
meters  of  an  object 

Allows  for  runtime  detection  of 
overwriting  the  memory  allocated 
for  the  object’s  variable-length 
parameters 

Parameter  Typing 

Empioys  a  base  ciass  that  defines 
functions  that  add,  find,  and  iterate 
over  Type-Length-Vaiue  (TLV)  format¬ 
ted  message  parameters 

Uses  strong  typing  to  build  and 
parse  messages 
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Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report 

Mechanism 

Resuit 

Voting 

Triple  Modular 
Redundancy 

Three  identical  processing  units,  each 
receiving  identical  inputs,  whose  output 
is  forwarded  to  voting  logic 

Detects  any  inconsistency  among 
the  three  output  states,  which  is 
treated  as  a  system  fault 

Expiration  of  the  watchdog  counter/timer  provides  an  indication  to  the  System  Monitor  that  the 
process  being  monitored  has  incurred  a  fault.  When  the  underlying  fault  detection  mechanism 
employs  a  periodic  message  exchange  between  the  System  Monitor  and  the  process  being  moni¬ 
tored,  this  is  referred  to  as  a  Heartbeat.  For  larger  systems  where  scalability  is  a  concern,  trans¬ 
port  and  processing  overhead  efficiency  can  be  increased  by  piggybacking  Heartbeat  messages  on 
to  other  control  messaging  being  exchanged  between  the  process  being  monitored  and  the  distri¬ 
buted  system  controller.  In  this  case,  there  is  an  added  dependency  between  the  Messaging  Sys¬ 
tem  and  the  System  Monitor.  Based  on  the  above  discussion,  we  revise  the  Fault  Detection  tactic 
category  described  by  Bass  [Bass  2003]  to  include  a  System  Monitor  tactic  that  is  further  refined 
to  include  a  Watchdog  and  a  Heartbeat  tactic. 

The  Voting  fault  detection  tactic  is  based  on  the  fundamental  contributions  to  automata  theory  by 
Von  Neumann,  who  demonstrated  how  systems  having  a  prescribed  reliability  could  be  built  from 
unreliable  components  [Von  Neumann  1956].  The  common  realization  of  this  tactic  is  referred  to 
as  Triple  Modular  Redundancy  (TMR),  which  employs  three  identical  processing  units,  each  of 
which  receives  identical  inputs,  and  forwards  their  output  to  voting  logic,  used  to  detect  any  in¬ 
consistency  among  the  three  output  states.  Any  such  inconsistency  is  treated  as  a  system  fault. 
TMR  depends  critically  on  the  voting  logic,  which  can  be  realized  either  as  a  singleton  where  the 
probability  of  error  is  sufficiently  low  (the  voting  logic  is  a  simple  Boolean  AND/OR  combina¬ 
tion)  or  as  a  redundant  triple  [Lyons  1962].  To  demonstrate  the  improvement  in  system  reliability 
(specifically,  MTBF)  of  a  TMR-based  design,  consider  a  system  where  the  probability  of  error  of 
a  single  bit  is  defined  as  8e.  Applying  TMR  to  that  single  bit  will  reduce  the  probability  of  error 
from  8e  to 


^TMR  ~  )  ■*■ 

For  example,  if  a  single  component  has  an  error  rate  of  .001,  a  TMR  version  of  this  component 
will  have  an  error  rate  of  0.000002998,  or  about  three  orders  of  magnitude  better.  TMR  is  com¬ 
monly  realized  at  the  hardware  gate  and  chip  level,  but  can  also  be  employed  in  software  at  the 
thread  or  process  level  for  scenarios  where  the  outputs  of  the  multiple  threads  or  processes  can  be 
synchronized  by  the  voting  logic. 

The  final  Fault  Detection  tactic  identified  by  Bass  is  the  Exception  tactic  [Bass  2003].  The  Ex¬ 
ception  tactic  can  be  further  refined  into  Exception  Detection,  Exception  Handling,  and  Exception 
Prevention  tactics.  Exception  Detection  refers  to  the  detection  of  a  system  condition  that  alters 
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the  normal  flow  of  execution.  For  distributed  real-time  embedded  systems,  the  Exception  Detec¬ 
tion  tactic  can  be  further  refined  to  include  System  Exceptions,  Parameter  Eence,  and  Parameter 
Typing  tactics.  System  Exceptions  will  vary  according  to  the  processor  hardware  architecture  em¬ 
ployed  and  include  faults  such  as  divide  by  zero,  bus  and  address  faults,  illegal  program  instruc¬ 
tions,  and  so  forth.  The  Parameter  Pence  tactic  incorporates  an  a  priori  data  pattern  (such  as 
OxDEADBEEF)  placed  immediately  after  any  variable -length  parameters  of  an  object.  This  al¬ 
lows  for  runtime  detection  of  overwriting  the  memory  allocated  for  the  object’s  variable-length 
parameters  [Utas  2005].  Parameter  Typing  employs  a  base  class  that  defines  functions  that  add, 
find,  and  iterate  over  Type-Length-Value  (TLV)  formatted  message  parameters.  Derived  classes 
use  the  base  class  functions  to  implement  functions  that  provide  Parameter  Typing  according  to 
each  parameter’s  structure.  Use  of  strong  typing  to  build  and  parse  messages  will  result  in  higher 
availability  than  implementations  that  simply  treat  messages  as  byte  buckets  [Utas  2005]. 

2.3  Fault  Recovery  Tactics 

Pault  Recovery  tactics  are  refined  into  Preparation  and  Repair  tactics  and  Reintroduction  tactics. 
Preparation  and  Repair  tactics  include  Active  Redundancy,  Passive  Redundancy,  Spare,  Excep- 
tion  Handling,  and  Software  Upgrade.  Reintroduction  tactics  include  Shadow,  Rollback,  Esca¬ 
lating  Restart,  and  Non-Stop  Porwarding. 

Fligh-availability  distributed  real-time  embedded  systems  commonly  employ  a  strategy  of  equip¬ 
ment  protection,  where  spatially  redundant  line  cards  (circuit  packs)  are  employed  in  a  hot,  warm, 
or  cold  sparing  configuration.  These  three  configurations  are  referred  to  by  Bass  as  active  redun¬ 
dancy  (hot  sparing),  redundancy  (warm  sparing),  and  simply  sparing  (cold  sparing)  [Bass 

2003].  In  our  updated  catalog  of  availability  tactics,  we  refer  to  cold  sparing  as  the  Spare  tactic. 
Before  describing  each  of  these  three  configurations,  we  first  define  a  protection  group  as  being  a 
group  of  processing  nodes  where  one  or  more  nodes  are  “active”  with  the  remaining  nodes  in  the 
protection  group  serving  as  redundant  spares. 

Active  Redundancy  refers  to  a  configuration  where  all  of  the  nodes  (active  or  redundant  spare)  in  a 
protection  group  receive  and  process  identical  inputs  in  parallel,  allowing  the  redundant  spare(s) 
to  maintain  synchronous  state  with  the  active  node(s).  Because  the  redundant  spare  possesses  an 
identical  state  to  the  active  processor,  recovery  and  repair  can  occur  in  time  measured  in  millise¬ 
conds.  The  simple  case  of  one  active  node  and  one  redundant  spare  node  is  commonly  referred  to 
as  1+1  (“one  plus  one”)  redundancy.  Active  Redundancy  can  also  be  used  for  Pacilities  Protec¬ 
tion,  where  active  and  standby  network  links  are  used  to  ensure  highly  available  network  connec¬ 
tivity.  Standards-based  realizations  of  Active  Redundancy  exist  for  protecting  network  links  (i.e., 
facilities)  at  both  the  physical  layer  [Bellcore  1998,  1999,  Telcordia  2000]  and  network/link  layer 
[lETE  2005]. 

Passive  Redundancy  refers  to  a  configuration  where  only  the  active  members  of  the  protection 
group  process  input  traffic,  with  the  redundant  spare(s)  receiving  periodic  state  updates.  Because 
the  state  maintained  by  the  redundant  spares  is  only  loosely  coupled  with  that  of  the  active 
node(s)  in  the  protection  group  (with  the  looseness  of  the  coupling  being  a  function  of  the  check- 

^  Many  tactics  span  multiple  categories.  For  example,  Voting  can  be  considered  a  Fault  Detection  tactic,  (detect¬ 
ing  a  dissenting  vote),  or  a  Fault  Preparation  and  Repair  tactic  (correlating  the  fault).  Voting  also  aids  in  Fault 
Prevention  (by  identifying  a  processor  to  be  reset  or  repaired).  Flowever,  in  the  taxonomy  we  have  chosen  to 
include  it  within  the  Fault  Detection  category. 
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pointing  mechanism  employed  between  active  and  redundant  nodes),  the  redundant  nodes  are  re¬ 
ferred  to  as  warm  spares.  Depending  on  a  system’s  availability  requirements,  Passive  Redundan¬ 
cy  provides  a  solution  that  achieves  a  balance  between  the  more  highly  available  but  more 
plex  Active  Redundancy  tactic  and  the  less  available  but  significantly  less  complex  Spare  tactic. 

Cold  sparing,  or  simply  sparing  in  the  nomenclature  of  Bass,  refers  to  a  configuration  where  the 
redundant  spares  of  a  protection  group  remain  out  of  service  until  a  fail-over  occurs,  at  which 
point  a  Power-On-Reset  procedure  is  initiated  on  the  redundant  spare  prior  to  its  being  placed  in 
service  [Bass  2003].  Due  to  its  poor  recovery  performance,  cold  sparing  is  better  suited  for  sys¬ 
tems  having  only  high-reliability  (MTBF)  requirements  as  opposed  to  those  also  having  high- 
availability  requirements. 

In  practice,  the  system  architect  will  determine  whether  to  use  Active  Redundancy,  Passive  Re¬ 
dundancy,  or  Spare  based  on  the  system  availability  requirements  allocated.  Figure  3  illustrates 
the  data  flow  for  each  of  these  three  tactics,  embedded  in  the  context  of  patterns. 

Recall  from  Section  2.2  that  the  Exception  tactic  can  be  refined  into  Exception  Detection,  Excep¬ 
tion  Handling,  and  Exception  Prevention  tactics,  with  Exception  Detection  being  discussed  in  that 
section.  The  mechanism  employed  for  Exception  Handling  depends  largely  on  the  programming 
environment  employed,  ranging  from  simple  function  return  codes  {Error  Codes)  to  the  use  of 
Exception  Classes  that  contain  information  helpful  in  fault  correlation,  such  as  the  name  of  the 
exception  thrown,  the  origin  of  the  exception,  and  the  cause  of  the  exception  thrown  [Powell- 
Douglas  1999]. 
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Redundant  Node 
(Cold  Spare) 


Input  Data 

(Cold )  Sparing  (Uve  Traffic) 


Figure  3:  System  Redundancy  Tactics  Embedded  in  Patterns 


Software  Upgrade  is  another  Preparation  and  Repair  tactic  whose  goal  is  to  achieve  in-service 
upgrades  to  executable  code  images  in  a  non-service-affecting  manner  [Scott  2008].  This  tactic  is 
refined  by  Function  Patch,  Class  Patch,  and  Hitless  In-Service  Software  Upgrade  (ISSU)  tactics. 
The  Function  Patch  tactic  is  used  in  a  procedural  programming  environment  and  employs  an  in¬ 
cremental  linker/loader  to  store  an  updated  software  function  into  a  pre-allocated  segment  of  tar¬ 
get  memory.  The  new  version  of  the  software  function  will  employ  the  entry  and  exit  points  of 
the  deprecated  function.  Also,  upon  loading  the  new  software  function,  the  symbol  table  must  be 
updated  and  the  instruction  cache  invalidated.  The  Class  Patch  tactic  is  applicable  for  targets  ex¬ 
ecuting  object-oriented  code,  where  the  class  definitions  include  a  backdoor  mechanism  that 
enables  the  runtime  addition  of  member  data  and  functions.  Hitless  In-Service  Software  Upgrade 
is  a  tactic  that  leverages  the  Active  Redundancy  or  Passive  Redundancy  tactics  to  achieve  non- 
service-affecting  upgrades  to  software  and  associated  schema.  In  practice,  the  Function  Patch 
and  Class  Patch  tactics  are  used  to  deliver  bug  fixes  while  the  Hitless  In-Service  Software  Up¬ 
grade  tactic  is  used  to  deliver  new  features  and  capabilities. 
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Table  3:  Availability  Tactics:  Fault  Recovery 


Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report. 

Mechanism 

Result 

Preparation  and 
Repair 

Active  Redundancy 

Configuration  wherein  all  of  the 
nodes  (active  or  redundant  spare)  in 
a  protection  group  receive  and 
process  identical  inputs  in  parallel 

Redundant  spare  possesses 
an  identical  state  to  the  ac¬ 
tive  processor,  so  recovery 
and  repair  can  occur  in  milli¬ 
seconds 

Passive  Redundancy 

Configuration  wherein  only  the  ac¬ 
tive  members  of  the  protection 
group  process  input  traffic,  with  the 
redundant  spare(s)  receiving  peri¬ 
odic  state  updates 

Achieves  a  balance  between 
the  more  highly  available  but 
more  complex  Active  Re¬ 
dundancy  tactic  and  the  less 
available  but  significantly 
less  complex  Spare  tactic 

Spare 

Configuration  wherein  the  redun¬ 
dant  spares  of  a  protection  group 
remain  out  of  service  until  a  fail-over 

occurs 

Initiates  power-on-reset  pro¬ 
cedure  on  the  redundant 
spare  prior  to  its  being 
placed  in  service 

Exception  Handiing 

Depends  largely  on  the  program¬ 
ming  environment  employed,  rang¬ 
ing  from  simple  function  return 
codes  {Error  Codes)  to  the  use  of 
Exception  Classes  that  contain  in¬ 
formation  helpful  in  fault  correlation 

Exception  Classes 

Contain  information  helpful  in  fault 
correlation,  such  as  the  name  of  the 
exception  thrown,  the  origin  of  the 
exception,  and  the  cause  of  the  ex¬ 
ception  thrown 

Allows  system  to  trans¬ 
parently  recover  from  system 
exceptions 

Software  Upgrade 

Achieves  in-service  up¬ 
grades  to  executable  code 
images  in  a  non-service  af¬ 
fecting 
manner 

Function  Patch 

An  incremental  linker/loader  in  a 
procedural  programming 
environment 

Stores  an  updated  software 
function  into  a  pre-allocated 
segment  of  target  memory 
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Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report. 

Mechanism 

Result 

Class  Patch 

Applicable  for  targets  executing  ob¬ 
ject-oriented  code,  where  the  class 
definitions  include  a  backdoor 
mechanism 

Enables  the  runtime  addition 
of  member  data  and 
functions 

Hitless  In-Service 
Software  Upgrade 
(ISSU) 

Leverages  the  Active  Redundancy 
or  Passive  Redundancy  tactics 

Achieves  non-service- 
affecting  upgrades  to  soft¬ 
ware  and  associated  schema 

Reintroduction 

Failed  component  is  reintro¬ 
duced  after  correction 

Shadow 

Operates  a  previously  failed  or  in- 
service  upgraded  component  in  a 
“shadow  mode”  for  a  pre-defined 
duration  of  time 

Reverts  the  component  back 
to  an  active  role 

State 

Resynchronization 

When  realized  as  a  refinement  to 
the  Active  Redundancy  tactic,  oc¬ 
curs  organically  as  active  and 
standby  components  each  receive 
and  process  identical  inputs  in 
parallel 

When  realized  as  a  refinement  to 
the  Passive  Redundancy  tactic,  is 
based  solely  on  periodic  state  in¬ 
formation  transmitted  from  the  ac¬ 
tive  component(s)  to  the  standby 
component  and  involves  the  Roii- 
back  tactic 

Allows  a  system’s  control 
element  to  dynamically  re¬ 
cover  its  control  plane  state 
from  its  network  peers 

Periodically  compares  state 
of  the  active  and  standby 
components  to  ensure 
synchronization 

Graceful  Restart 

Includes  inter-networking  devices 
such  as  cross-connects,  switches, 
and  packet  routers 

Allows  a  system’s  control 
element  to  dynamically  re¬ 
cover  its  control  plane  state 
from  its  network  peers 

Rollback 

Checkpoint  based 

Allows  the  system  state  to  be 
reverted  to  the  most  recent 
consistent  set  of  checkpoints 

Coordinated 

Checkpointing 

Allows  processes  to  resolve  depen¬ 
dencies  and  restart  at  a  coordinated 
checkpoint 

More  complex  mechanism 
that  is  always  consistent 
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Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report. 

Mechanism 

Result 

Uncoordinated 

Checkpointing 

Aiiows  processes  to  take  check¬ 
points  when  most  convenient 

Simpie  mechanism  but  may 
not  aiways  resuit  in  a  consis¬ 
tent  state 

Escalating  Restart 

Varies  the  granuiarity  of  the  compo- 
nent(s)  restarted  and  minimizes  the 
ievei  of  service  affectation 

At  finai  ievei,  compieteiy 
reioads  and  reinitiaiizes  the 
executabie  image  and  asso¬ 
ciated  data  segment 

Non-Stop 

Forwarding 

Inciudes  inter-networking  devices 
such  as  cross-connects,  switches, 
and  packet  routers 

Maintains  proper  functioning 
of  the  user  data  piane  (bear¬ 
er  channei  services) 

Other  Preparation  and  Repair  tactics  exist  that  are  primarily  realized  hy  a  hardware  design  and, 
hence,  heyond  the  scope  of  this  work.  Examples  include  Error  Detection  and  Correction 
(ED  AC)  coding.  Forward  Error  Correction  (EEC),  and  Temporal  Redundancy.  ED  AC  coding  is 
typically  used  to  protect  control  memory  structures  in  high-availahility  distributed  real-time  em¬ 
bedded  systems  [Hamming  1980].  Conversely,  EEC  coding  is  typically  employed  to  recovery 
from  physical  layer  errors  occurring  on  external  network  links  [Morelos-Zaragoza  2006].  Tem¬ 
poral  Redundancy  involves  sampling  spatially  redundant  clock  or  data  lines  at  time  intervals  that 
exceed  the  pulse  width  of  any  transient  pulse  to  be  tolerated,  and  then  voting  out  any  defects  de¬ 
tected  [Mavis  2002]. 

Some  Fault  Recovery  tactics  rely  on  component  reintroduction,  where  a  failed  component  is  rein¬ 
troduced  after  it  has  been  corrected.  Reintroduction  tactics  include  Shadow,  State  Resynchroniza¬ 
tion/Graceful  Restart,  Rollback,  Escalating  Restart,  and  Non-Stop  Forwarding. 

The  Shadow  tactic  refers  to  operating  a  previously  failed  or  in-service  upgraded  component  in  a 
“shadow  mode”  for  a  pre-defined  duration  of  time  prior  to  reverting  the  component  back  to  an 
active  role.  In  this  context,  the  Shadow  tactic  is  a  Reintroduction  version  of  the  Hitless  In-Service 
Software  Upgrade  tactic  previously  discussed  as  a  Preparation  and  Repair  tactic. 

Similarly,  State  Resynchronization  is  a  reintroduction  refinement  to  the  Active  Redundancy  and 
Passive  Redundancy  preparation  and  repair  tactics.  When  realized  as  a  refinement  to  the  Active 
Redundancy  tactic,  the  State  Resynchronization  occurs  organically,  as  the  active  and  standby 
components  each  receive  and  process  identical  inputs  in  parallel.  In  practice,  the  states  of  the  ac¬ 
tive  and  standby  components  are  periodically  compared  to  ensure  synchronization.  This  compari¬ 
son  may  be  based  on  a  cyclic  redundancy  check  calculation  [Morelos-Zaragoza  2006]  or,  for  sys¬ 
tems  providing  safety-critical  services,  a  message  digest  calculation  (a  one-way  hash  function) 
[Schneier  2001].  Conversely,  when  realized  as  a  refinement  to  the  Passive  Redundancy  (warm 
sparing)  tactic.  State  Re  synchronization  is  based  solely  on  periodic  state  information  transmitted 
from  the  active  component(s)  to  the  standby  component(s).  This  operation  involves  an  additional 
reintroduction  tactic  referred  to  as  Rollback.  Rollback  is  a  checkpoint-based  recovery  mechanism 
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that  allows  the  system  state  to  be  reverted  to  the  most  recent  consistent  set  of  checkpoints.  The 
set  of  checkpoints  is  referred  to  as  a  “recovery  line,”  which  may  be  generated  using  graph- 
theoretic  techniques  described  by  Elnozahy  [Elnozahy  2002].  Checkpoint-based  rollback  recov¬ 
ery  may  employ  Uncoordinated  Checkpointing,  where  processes  are  allowed  to  take  checkpoints 
when  most  convenient,  or  Coordinated  Checkpointing,  where  processes  resolve  dependencies  and 
restart  at  a  coordinated  checkpoint  [Scott  2008].  A  specialization  of  State  Resynchronization, 
used  in  tandem  with  the  Non-Stop  Forwarding  tactic  is  Graceful  Restart,  which  allows  a  system’s 
control  element  to  dynamically  recover  its  control  plane  state  from  its  network  peers.  Standard 
realizations  of  Graceful  Restart  have  emerged  for  a  variety  of  commonly  deployed  routing  and 
signaling  protocols  such  as  BGP  [lETE  2007b,  2007c],  OSPF  [lETE  2007a] [lETE  2008],  LDP 
[lETE  2003b]  and  RSVP-TE  [lETE  2003a,  2003c]. 

Escalating  Restart  is  a  Reintroduction  tactic  that  allows  the  system  to  recover  from  faults  by  vary¬ 
ing  the  granularity  of  the  component(s)  restarted  and  minimizing  the  level  of  service  affectation 
[Utas  2005].  Eor  example,  consider  a  system  that  supports  four  levels  of  restart,  as  follows.  The 
lowest  level  of  restart  (call  it  Level  0),  and  hence  least  impacting  on  services,  employs  Passive 
Redundancy  (warm  restart),  where  all  child  threads  of  the  component  in  which  the  fault  was  de¬ 
tected  are  killed  and  recreated.  In  this  way,  only  data  associated  with  the  child  threads  is  freed 
and  reinitialized.  The  next  level  of  restart  (Level  1)  frees  and  reinitializes  all  unprotected  memory 
(protected  memory  would  remain  untouched).  The  next  level  of  restart  (Level  2)  frees  and  reini¬ 
tializes  all  memory,  both  protected  and  unprotected,  forcing  all  applications  to  reload  and  reini¬ 
tialize.  And  the  final  level  of  restart  (Level  3)  would  involve  completely  reloading  and  reinitializ¬ 
ing  the  executable  image  and  associated  data  segments.  Support  for  the  Escalating  Restart  tactic 
is  particularly  useful  for  the  concept  of  graceful  degradation,  where  a  system  is  able  to  degrade 
the  services  it  provides  while  maintaining  support  for  mission-critical  or  safety-critical  applica¬ 
tions. 

Another  Reintroduction  tactic  used  to  enable  graceful  degradation  of  high-availability  systems  is 
Non-Stop  Eorwarding  (NSL).  This  concept  is  borrowed  from  commercial  best  current  practice  in 
designing  inter-networking  devices,  such  as  cross-connects,  switches,  and  packet  routers.  Non- 
Stop  Eorwarding  refers  to  the  ability  of  the  device  to  maintain  proper  functioning  of  the  user  data 
plane  (bearer  channel  services),  even  when  the  device’s  control  and/or  management  planes  are  out 
of  service.  Support  for  Non-Stop  Eorwarding  implies  a  strict  separation  of  control/management 
and  data  plane  functionality  in  the  system  design,  as  described  through  the  Internet  Engineering 
Task  Force  [IETF  2004].  In  heritage  digital  cross-connect  (DXC)  and  ATM  switch  design,  this 
tactic  is  referred  to  as  the  “Headless  Mode”  of  operation.  The  term  Non-Stop  Eorwarding  has 
emerged  as  the  standard  nomenclature  used  when  the  Headless  Mode  tactic  is  applied  to  packet 
router  designs  targeting  high-availability  services. 

2.4  Fault  Prevention  Tactics 

Eault  Prevention  tactics  include  Removal  from  Service,  Transactions,  Process  Monitor,  and  Ex¬ 
ception  Prevention.  In  the  context  of  fault  prevention,  the  Removal  from  Service  tactic  refers  to 
placing  a  system  component  in  an  out-of-service  state  for  the  purpose  of  mitigating  potential  sys¬ 
tem  failures.  One  example  involves  taking  a  component  of  a  system  out  of  service  and  resetting 
the  component  in  order  to  scrub  latent  faults  (such  as  memory  leaks,  fragmentation,  or  soft  errors 


16  I  CMU/SEI-2009-TR-006 


in  an  unprotected  cache)  before  the  accumulation  of  faults  become  service  affecting  (resulting  in 
system  failure). 

Systems  targeting  high-availability  services  leverage  transactional  semantics  to  ensure  that  asyn¬ 
chronous  messages  exchanged  between  distributed  components  are  atomic,  consistent,  isolated, 
and  durable  [Gray  1993].  These  four  properties  are  referred  to  as  the  “ACID  properties”  and  are 
generally  a  requirement  for  high-availability  systems,  particularly  those  that  provide  either 
mission-critical  or  safety-critical  services.  The  Transactions  tactic  is  typically  realized  using  an 
“atomic  commit  protocol,”  the  most  common  of  which  is  the  “two-phase  commit”  (a.k.a.  2PC) 
protocol,  originally  described  by  Gray  [Gray  1993].  Figure  4  illustrates  the  successful  case  of  a 
distributed  two-phase  commit  transaction.  For  the  case  where  a  distributed  two-phase  commit 
transaction  fails,  the  transaction  coordinator  will  employ  the  Rollback  tactic  among  all  distributed 
components  involved  in  the  failed  transaction,  in  order  to  ensure  a  consistent  and  durable  system 
state. 


Transaction 

Coordinator 


Coordinator  State  = 
Active 


Worker 


Phase  1  Request; 
Prepare  Transaction 


Worker  State  = 
Active 


Coordinator  State  = 
Prepared 


Worker  State  = 
Prepared 


Coordinator  State  = 
Committing 


Phase  1  Response: 
OK 


Phase  2  Request: 
Commit  Transaction 


Worker  State  = 
Committing 


Coordinator  State  = 
Committed 


Phase  2  Response: 
ACKnowledge 


Worker  State  = 
Committed 


Figure  4:  Transactions  Tactic:  Two-Phase  Commit 
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In  the  context  of  fault  prevention,  the  Process  Monitor  tactic  is  employed  to  monitor  the  state  of 
health  (SOH)  of  a  system  process  in  order  to  ensure  that  the  system  is  operating  within  its  nominal 
operating  parameters.  This  tactic  has  a  symbiotic  relation  to  the  System  Monitor  previously  de¬ 
scribed,  and  in  practice,  the  Process  Monitor  may  be  a  lower  level  function  of  a  hierarchical  sys¬ 
tem  monitoring  function. 

Recall  from  the  previous  discussion  that  the  Exception  tactic  has  been  refined  into  Exception  De¬ 
tection,  Exception  Handling,  and  Exception  Prevention  tactics,  with  Exception  Detection  de¬ 
scribed  in  Section  2.2  and  the  Exception  Handling  tactic  described  in  Section  2.3.  The  Exception 
Prevention  tactic  refers  to  techniques  employed  for  the  purpose  of  preventing  system  exceptions 
from  occurring.  The  use  of  Exception  Classes,  which  allows  a  system  to  transparently  recovery 
from  system  exceptions,  was  discussed  in  Section  2.3  [Powell-Douglas  1999].  Other  examples  of 
tactical  refinements  used  to  realize  Exception  Prevention  include  abstract  data  types  such  as  Smart 
Pointers  and  the  use  of  Wrappers  [Gamma  1995]  to  prevent  faults  such  as  dangling  pointers  and 
semaphore  access  violations  from  occurring. 


Table  4:  Availability  Tactics:  Fault  Prevention 


Tactic 

Control/click  on  tactic  to  go 
to  its  description  in  report 

Mechanism 

Result 

Removal  from 

Service 

Piaces  a  system  component  in  an 
out-of-service  state 

Aiiows  for  mitigating  potentiai  sys¬ 
tem  faiiures  before  their  accumu- 
iation  affects  service 

T  ransactions 

Ensures  a  consistent  and  durabie 
system  state 

Atomic  Commit 
Protocoi 

Most  commoniy  the  two-phase 
commit 

Process  Monitor 

Monitors  the  state  of  heaith  of  a 
system  process;  ensures  that  the 
system  is  operating  within  its  no- 
minai  operating  parameters 

Exception 

Prevention 

Prevents  system  exceptions  from 
occurring 

Exception  Ciasses 

Contain  information  heipfui  in  fauit 
correiation,  such  as  the  name  of 
the  exception  thrown,  the  origin  of 
the  exception,  and  the  cause  of  the 
exception  thrown 

Aiiows  system  to  transparency 
recover  from  system  exceptions 

Smart  Pointers/ 
Wrappers 

Abstract  data  types  that  controi  aii 
access  through  pointers 

Prevent  fauits  such  as  dangiing 
pointers  and  semaphore  access 
vioiations 
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3  An  Example 


We  will  now  present  an  example  from  the  internetworking  domain,  where  a  network  node  is  re¬ 
sponsible  for  providing  high-availahility  services,  such  as  cross-connecting  voice  circuits,  switch¬ 
ing  cells,  frames,  or  packets,  or  routing  and  forwarding  IP  packets. 

The  system  architect,  knowing  that  high  availability  is  crucial,  has  determined  that  some  form  of 
redundancy  will  be  employed.  The  architect’s  next  step  is  to  determine  and  apply  the  appropriate 
redundancy  tactics  to  ensure  compliance  with  the  system’s  precise  availability  requirements. 
Architectural  insight  can  and  should  be  developed  through  model-based  analysis  wherever  possi¬ 
ble.  Consider  the  case  of  an  architecture  being  developed  for  a  high-availability  system  having  a 
99.999%  (a.k.a.  “5  nines”)  availability  requirement.  We  will  show  how  we  employ  a  Markov 
model  to  determine  the  appropriate  redundancy  tactic  to  employ  in  the  system  architecture. 

3.1  The  Availability  Model 

The  availability  model,  from  Gokhale,  employed  and  illustrated  in  Figure  5,  takes  into  account  the 
Mean  Time  Between  Failure  and  Mean  Time  To  Recover  for  both  hardware  and  software  compo¬ 
nents  of  the  system  [Gokhale  2005].  In  addition,  failures  are  characterized  according  to  high  and 
low  severity  levels.  Each  of  the  hardware  and  software  components  can  be  in  one  of  three  states: 
In-Service,  Degraded  Service,  or  Out-of-Service.  In-Service  implies  nominal  operation  with  no 
faults  having  been  detected  and  with  service  levels  consistent  with  the  components’  specifica¬ 
tions.  Degraded  Service  implies  that  a  severity  2  (low  priority)  fault  has  been  detected  and  corre¬ 
lated  and  is  in  the  process  of  being  mitigated.  Mitigation  strategies  for  severity  2  faults  may  range 
from  resetting  ASIC  personalities  or  software  processes  to  executing  a  managed  fail-over  to  a  re¬ 
dundant  processor.  Out-of-Service  implies  that  a  severity  1  (high  priority)  fault  has  been  detected 
and  correlated  and  is  in  the  process  of  being  mitigated.  Mitigation  of  severity  1  faults  typically 
involves  a  managed  fail-over  to  a  redundant  processor.  The  combination  of  three  states  for  hard¬ 
ware  and  software  components  results  in  a  state  space  of  nine  states  in  the  Markov  model.  The 
simultaneous  severity  1  failure  state  (i.e.,  severity  1  faults  in  both  hardware  and  software)  is  not 
considered  to  be  a  valid  state,  as  the  fault  detection  and  correlation  logic  would  assign  the  (root 
cause)  fault  to  either  software  or  hardware,  but  not  both. 

The  model  assumes  that  MTBF  values  for  all  four  types  of  failures  are  exponentially  distributed. 
MTBF  values  for  severity  1  and  2  hardware  and  software  failure  states  are  denoted  in  the  model 
using  the  {^hw-sevi,  ^hw-sev2,  ^sw-sevi,  ^sw-sevi}  nomenclature.  Similarly,  the  model  assumes 
that  MTTR  values  for  recovering  from  the  four  types  of  failures  are  exponentially  distributed. 
MTTR  values  for  recovering  from  severity  1  and  2  hardware  and  software  failure  states  are  de¬ 
noted  by  the  (phw-sevi,  Fhw-sevi,  Fsw-sevi,  Fsw-sevi}  nomenclature.  The  steady-state  system 
availability  (A)  is  determined  by  adding  the  probabilities  of  the  system  operating  in  a  nominal  In- 
Service  state  and  the  various  combinations  of  In-Service  and  Degraded  Service  states. 

A  =  P{HW,,  :  5W„)  -h  P{HW,,  :  SW,,y,)+P{HWs,y, :  5W,,)  +P{HW,,y, : 
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In  order  to  solve  the  preceding  equation,  a  system  of  balance  equations  is  derived  from  the  availa¬ 
bility  model.  The  derivation  of  the  balance  equations  and  resulting  solution  for  this  model  are 
found  in  Gokhale  [Gokhale  2005]. 

Table  5  provides  an  analysis  of  system  availability  for  three  separate  Redundancy  tactics:  Active 
(hot  sparing),  Passive  (warm  sparing),  and  Spare  (cold  sparing).  The  analysis  assumes  that  soft¬ 
ware  failures  are  five  times  more  likely  to  occur  than  hardware  failures  and  that  severity  2  failures 
are  also  five  times  more  likely  to  occur  than  severity  1  failures.  The  analysis  assumes  that  the 
MTTR  for  severity  2  failures  is  30  seconds  for  each  scenario  (active,  passive,  and  spare)  and  that 
the  MTTR  for  severity  1  failures  is  one  second  for  Active  Redundancy,  five  seconds  for  Passive 
Redundancy,  and  15  minutes  for  the  (cold  sparing)  Spare  tactic.  Wherever  possible  such  assump¬ 
tions  can  and  should  be  validated  by  measurements  of  prototypes  or  existing  systems. 

From  this  analysis,  the  system  architect  is  able  to  determine  that  an  availability  requirement  of 
99.999%  can  be  met  using  the  Passive  Redundancy  (warm  sparing)  tactic,  albeit  with  no  addition¬ 
al  margin.  Note  also  that  this  analysis  also  indicates  the  difficulty  in  achieving  a  more  stringent 
99.9999%  (a.k.a.  “6  nines”)  of  availability,  even  when  the  more  complex  and  more  costly  Active 
Redundancy  tactic  is  used. 


HWb:SW,s 

\w-sev2 

SW;  In-Service 

Msev2/ 


►  j 

HWsEwiSWe 

'Nw-sevl 

HW;  Degraded  Service 
SW;  In-Service 

'^'sw-sevl 

'^sw-sev2 


^  HWsevi:SW,s  ' 

■^sevl  / 

1  Msevi 

^  HWb:SWsevi  ^ 

HW:  Out-of-Service 
SW;  In-Service 

k  J 

HW:  In-Service 

SW;  Out-of-Service 
_ / 

Msev1 

\  Msev1 

HWsevjiSWsevi 

HWs£vi:SWs«v2 

HW;  Degraded  Service 
SW:  Out-of-Service 

V  J 

HW:  Out-of-Service 

SW:  Degraded  Service 

N.'^'sw-sevl 


HWis:SWs€V2 


HW:  In-Service 

\iw-sevi  '  Degraded  Service 


'NTA'-sev2 


Msev2 


\iw-sevl/ 


HWscv2:SWsev2 

'  HW:  Degraded  Service 
SW:  Degraded  Service 


HWscvi:SWs€vi 


HW:  Out-of-Service 
SW:  Out-of-Service 


Figure  5:  Markov  Model  of  System  A  vailability 
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Table  5:  System  Availability  of  Redundancy  Tactic  Used 


Function 

Failure  Severity 

MTBF  (Hours) 

MTTR  active  (Sec) 

MTTR  passive  (Sec) 

MTTR  sparing  (Sec) 

Hardware 

1 

250,000 

1 

5 

900 

2 

50,000 

30 

30 

30 

Software 

1 

50,000 

1 

5 

900 

2 

10,000 

30 

30 

30 

Availability 

0.999998 

0.999990 

0.9982 

3.2  The  Resulting  Redundancy  Tactic 

In  this  example  we  achieved  equipment  protection  via  the  Passive  Redundancy  (warm  sparing) 
tactic,  where  the  active  processor(s)  in  a  protection  group  receives  input  traffic  and  transmits 
checkpointed  state  periodically  to  the  associated  redundant  spare(s).  In  practice,  this  tactic 
achieves  a  reasonable  balance  between  design  complexity  and  performance  for  MTTR.  Systems 
designed  for  high  availability  (99.999%  uptime)  commonly  employ  Passive  Redundancy.  Con¬ 
versely,  systems  with  more  stringent  availability  requirements  will  necessarily  employ  the  more 
complex  Active  Redundancy  (hot  sparing)  tactic,  to  further  reduce  MTTR. 


Figure  6:  Passive  Redundancy  Tactics 

A  realization  of  the  Passive  Redundancy  tactic,  embedded  in  a  pattern,  is  illustrated  in  Figure  6, 
where  each  of  the  boxes  is  a  system  node.  The  pattern  employs  four  functions:  the  Active  Node, 
the  Redundant  (passive/warm  spare)  Node,  a  Redundancy  Manager,  and  a  System  Journal  (a  jour¬ 
naling  mechanism).  The  Active  and  Redundant  Nodes  are  identical  copies  of  a  system  processor 
(or  collection  of  processors).  The  Active  Node  periodically  transmits  checkpointed  state  data  to 
the  Redundant  Node,  to  maintain  a  loosely  synchronized  state.  The  System  Journal  is  employed 
to  ensure  that  transient  state  not  included  in  the  checkpoint  data  set  is  not  lost  as  a  result  of  a  role 
change  (where  the  Redundant  Node  assumes  the  Active  role).  The  Redundancy  Manager  is  used 
to  manage  the  assignment  of  (Active,  Spare)  roles  to  the  processors  in  the  protection  group  and 
detect  the  “liveness”  of  the  currently  designated  Active  Node,  to  identify  the  recovery  line  (i.e., 
the  most  recent  consistent  set  of  checkpointed  data),  and  to  manage  the  transmission  of  journaled 
data  from  the  System  Journal  to  a  Redundant  Node  undergoing  a  role  change  (when  assuming  the 
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role  of  an  Active  Node).  This  pattern  may  additionally  employ  a  Reintroduction  tactic  such  as 
Coordinated  Checkpointing,  where  the  processes  involved  autonomously  resolve  checkpoint  de¬ 
pendencies  and  restart  at  a  coordinated  checkpoint,  or  may  use  Uncoordinated  Checkpointing, 
where  each  process  is  allowed  to  save  a  checkpoint  when  it  is  most  convenient.  As  described  in 
Elnozahy,  Uncoordinated  Checkpointing  achieves  a  lower  latency  recovery  for  the  nominal  case 
but  can  be  subjected  to  resolving  a  cascade  of  checkpoint  dependency  issues  for  the  off-nominal 
case  (referred  to  as  the  “Domino  Effect”)  [Elnozahy  2002].  Coordinated  Checkpointing  elimi¬ 
nates  the  Domino  Effect  at  a  cost  of  slightly  higher  latency  for  each  recovery  (due  to  the 
processing  penalty  associated  with  central  management  of  the  distributed  checkpoint  data).  An 
excellent  survey  of  best  practices  for  checkpoint  rollback-recovery  protocols  is  provided  by 
Elnozahy  [Elnozahy  2002]. 

Eor  cases  where  either  a  singleton  spare  is  used  to  protect  multiple  active  nodes  (1:N,  pronounced 
“one  for  N”)  or  multiple  spares  are  used  to  protect  multiple  active  nodes  (M:N),  the  checkpoint 
and  journal  data  can  be  saved  to  persistent  (or  redundancy-protected)  memory,  with  the  Redun¬ 
dancy  Manager  directing  the  Redundant  Node  as  to  which  set  of  checkpoint  and  journaled  data  to 
retrieve  upon  a  role  change. 

In  addition  to  simplified  design  complexity  (when  compared  to  the  stricter  state  synchronization 
requirements  imposed  by  the  Active  Redundancy  tactic),  we  find  that  the  time  overhead  for  the 
nominal  case  of  Passive  Redundancy  is  low,  as  it  is  a  function  solely  of  the  time  required  to  ex¬ 
port  the  journaled  (transient,  uncheckpointed)  data  from  the  Active  Node  and  to  periodically  cal¬ 
culate  a  new  recovery  line,  save  the  data  as  a  checkpoint,  and  transmit  the  checkpointed  state  to 
the  Redundant  Node.  Similarly,  the  space  overhead  imposed  by  this  tactic  is  low,  bounded  by  the 
checkpointed  state  residing  on  the  Active  and  Redundant  Nodes  and  the  transient  journaled  data 
maintained  on  the  Active  Node  and  on  the  node  hosting  the  System  Journal.  And  finally,  the 
communication  overhead  imposed  by  this  tactic  for  the  nominal  case  is  also  low,  bounded  by  the 
control  messaging  used  to  register  with  the  System  Journal  and  for  the  Redundancy  Manager  to 
(re)assign  roles,  and  the  replication  of  checkpoint  and  journaled  data  between  Active  Node  and 
the  Redundant  Node  and  System  Journal,  respectively  [Saridakis  2002]. 

3.3  Tactics  Guide  Architectural  Decisions 

The  system  architect  needs  to  determine  the  appropriate  availability  tactic(s)  to  employ  based  on  a 
consideration  of  the  MTTR  requirements  for  the  various  network  services  enabled  by  the  device, 
as  well  as  the  side  effects  of  the  chosen  tactics.  Eor  example,  if  a  given  processor  in  the  system 
hosts  a  service  with  a  requirement  that  service  outages  be  repaired  on  sub-second  timescales,  then 
Active  Redundancy  (hot  sparing)  is  the  suitable  availability  tactic.  Conversely,  if  that  processor 
hosts  a  service  that  requires  that  outages  be  repaired  in  timescales  measured  in  seconds,  then  Pas¬ 
sive  Redundancy  (warm  sparing)  may  be  employed.  And,  if  the  system  requirements  allow  for 
service  outages  to  be  repaired  in  timescales  measured  in  minutes,  then  Sparing  (cold  sparing)  is  a 
suitable  availability  tactic  to  employ.  Note  also  that  some  systems  have  both  stringent  availability 
and  reliability  requirements,  in  which  case  it  may  be  necessary  to  employ  either  Active  Redundan¬ 
cy  or  Passive  Redundancy  (to  address  the  MTTR  component  of  the  availability  requirement)  in 
tandem  with  Sparing  (to  address  system  reliability  requirements  relating  to  Mean  Mission  Dura¬ 
tion).  Each  of  these  tactics  has  side  effects:  Active  Redundancy  consumes  more  runtime  resources 
(processing  and  communication)  than  Passive  Redundancy,  to  keep  the  redundant  components 


22  I  CMU/SEI-2009-TR-006 


synchronized.  And  both  Active  and  Passive  Redundancy  increase  the  cost  of  the  system  more  than 
Sparing.  In  each  case  the  availability  requirement,  along  with  performance  requirements  and  con¬ 
straints  on  costs,  guides  the  architect  to  a  choice  of  tactics,  even  though  the  analysis  is  initially  at 
a  crude,  often  heuristic,  level. 

In  determining  the  appropriate  availability  tactic(s)  to  employ,  the  system  architect  must  consider 
the  system  availability  require ment(s)  and  any  associated  availability  sub-allocations  for  the  con¬ 
stituent  system  components.  Recall  from  the  previous  discussion  that  availability  extends  the 
concept  of  system  reliability  (defined  by  the  MTBF)  to  include  the  notion  of  Mean  Time  to 
Recovery  (MTTR).  It  is  the  system  MTTR  (or  associated  MTTR  values  sub-allocated  to  the 
system’s  constituent  components)  that  determine  the  set  of  appropriate  availability  tactics  to 
consider. 

For  the  high-availability  example  provided,  various  architectural  alternatives  could  be  considered. 
For  example,  rather  than  employ  the  Passive  Redundancy  tactic,  the  architect  could  specify  the 
more  stringent  Active  Redundancy  tactic,  which  would  provide  a  greater  capability  with  regard  to 
system  availability  but  at  the  risk  of  over-engineering  the  system  (increasing  cost  and  complexity) 
for  the  given  set  of  requirements.  Once  Passive  Redundancy  had  been  selected  as  the  suitable 
tactic,  the  size  of  the  protection  group  had  to  be  considered.  For  example,  would  the  system  em¬ 
ploy  a  1 : 1  form  of  equipment  protection,  where  a  single  warm  spare  is  employed  for  each  active 
processor,  or  will  the  system  employ  a  1  :N  (one  redundant  spare  used  for  N  active  processors) 
form,  or  even  an  M:N  (M  redundant  spares  used  to  protect  N  active  processors)  form  of  equip¬ 
ment  protection? 

Next,  once  the  sparing  model  has  been  selected,  the  design  space  includes  multiple  options  for 
realizing  the  Reintroduction  tactic.  For  example,  transactional  semantics  can  be  employed  to  en¬ 
sure  that  the  ACID  properties  are  supported  by  the  system’s  distributed  database  (which  does  not 
imply  use  of  a  relational  database).  Alternately,  a  checkpointing  strategy  could  be  used,  leverag¬ 
ing  either  Coordinated  Checkpointing  or  Uncoordinated  Checkpointing  along  with  one  of  the 
rollback  recovery  protocols  described  by  Elnozahy  [Elnozahy  2002].  Eor  networking  devices  that 
host  link-state  protocols,  which  support  the  notion  of  peering  with  its  neighboring  nodes.  Reintro¬ 
duction  tactics  such  as  Graceful  Restart  and  Non-Stop  Forwarding  could  be  employed  in  order  to 
reacquire  control  plane  state  (from  its  peer)  while  maintaining  fault-free  user  data  services. 

Verifying  that  the  collection  of  tactics  employed  results  in  a  system  that  complies  with  its  availa¬ 
bility  requirements  involves  several  levels  of  analysis.  In  practice,  the  system  availability  re¬ 
quirement  is  decomposed  into  separate  sub-allocations  for  hardware-  and  software-induced  fail¬ 
ures.  Eor  hardware,  a  common  approach  for  estimating  system  reliability  (the  MTBE  component 
of  availability)  is  to  employ  a  Markov  Chain  to  model  the  interconnection  of  the  various  hardware 
components,  where  the  failure  rate  function  for  a  given  component  follows  a  Weibull  distribution. 
Modeling  the  system  reliability  for  software -induced  failures  differs  from  the  above  approach  due 
to  the  failure  rate  function  for  the  constituent  component’s  being  more  accurately  represented  by  a 
Poisson  distribution,  due  to  the  periodic  introduction  of  new  software  releases.  Popular  analytical 
models  used  to  estimate  software  reliability  are  based  on  Markov  Chains,  Non-Homogeneous 
Poisson  Processes  (NHPP),  and  Bayesian  formulations  [Musa  1987,  Xie  1991]. 

The  contribution  of  both  hardware-  and  software -induced  failures  to  system  availability  can  be 
computed  as  the  ratio  of  uptime  to  the  sum  of  uptime  and  downtime,  as  the  time  interval  over 
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which  the  measurement  is  made  approaches  infinity  (as  described  in  Section  3).  Note  that  the 
downtime  is  the  product  of  the  failure  intensity  and  the  MTTR.  To  reiterate,  only  service- 
affecting  failures  contrihute  to  the  system  availability. 
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4  Implications 


We  can  draw  a  number  of  implications  from  this  example.  The  most  important  is  that  while  archi¬ 
tectural  design,  even  for  a  fragment  of  a  system,  is  a  complex  search  through  a  potentially  un¬ 
bounded  space  of  possibilities,  the  use  of  tactics  guides  and  constrains  this  search  and  makes  it  far 
more  tractable.  Each  tactic  can  be  associated  with  an  analytic  model  (e.g.,  see  the  discussion  in 
Section  3.1).  So  tactics  work  on  two  levels:  similar  to  architectural  patterns  they  guide  the  archi¬ 
tect  towards  a  particular  solution — a  choice  of  a  tactic — ^but  unlike  architectural  patterns  they  sup¬ 
port  a  deeper  (more  precise)  form  of  analysis  based  on  analytic  models.  As  shown  in  our  example, 
these  analyses  may  range  from  guidelines  and  heuristics  to  precise  mathematical  models,  as  dic¬ 
tated  by  the  level  of  risk  surrounding  the  architecture’s  realization  of  a  quality  attribute. 

Tactics  are  the  building  blocks  of  patterns.  A  pattern,  such  as  the  one  shown  in  Figure  6,  is  a 
composition  of  multiple  tactics:  System  Monitor,  Passive  Redundancy,  State  Resynchronization, 
Rollback,  and  so  on.  Each  of  these  may  in  turn  be  decomposed  into  even  lower  level  tactics 
{Heartbeat,  Coordinated  Checkpoint,  Graceful  Restart,  Non-Stop  Forwarding,  etc.). 
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5  Conclusions 


This  report  has  presented  an  update  to  the  catalog  of  architectural  tactics  for  availability.  We  have 
a  similar  update  to  the  catalog  for  performance  that  space  prevents  us  from  presenting  here. 

These  updates  have  been  motivated  by  practice — by  observing,  and  categorizing,  the  sets  of  tac¬ 
tics  in  actual  use.  The  structure  of  neither  the  availability  nor  the  performance  tactic  catalog  has 
changed  dramatically  since  they  were  introduced  seven  years  ago;  this  shows  that  the  notion  of 
tactics  is  robust — they  are  fundamental  elements  of  design.  Only  the  realizations  of  the  tactics 
have  changed,  and  this  is  to  be  expected  as  technologies  mature. 

This  report  has  also  shown  how  the  catalog  of  tactics  can  be,  and  is,  used  in  practice,  to  guide  in 
making  fundamental  architectural  design  decisions  that  have  implications  in  multiple  dimensions. 
For  each  tactics-based  design  decision  there  are  heuristics  associated  with  the  decision,  as  well  as 
associated  analytic  models. 

From  this  presentation  we  can  see  that  tactics  are  useful  in  both  design  and  analysis.  They  are  use¬ 
ful  because  they  restrict  the  design  and  analysis  vocabulary,  reduce  the  size  of  the  search  space, 
and  directly  suggest  analytic  models. 
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